Search CORE

110 research outputs found

Split-alignment of genomes finds orthologies more accurately

Author: A Kuzniar
AE Darling
AK Hudek
AM Altenhoff
AM Altenhoff
B Paten
CN Dewey
CN Dewey
CN Dewey
D Earl
DJ States
E Passarge
E Sonnhammer
F Chiaromonte
G Lunter
G Lunter
I Dubchak
M Hou
M Nánási
Martin C Frith
MC Frith
MC Frith
MC Frith
MC Frith
MC Frith
MC Frith
MJ Chaisson
O Gotoh
P Berman
PJ Hastings
R Durbin
R Lopez
Risa Kawaguchi
S Kuraku
S Möller
S Schwartz
S Sheetlin
SF Altschul
SF Altschul
SM Kiełbasa
TF Smith
TJ Treangen
WJ Kent
YK Yu
Z Zhang
Z Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Optimizing substitution matrix choice and gap parameters for sequence alignment

Author: CB Do
CB Do
CN Dewey
D Gusfield
DT Jones
E Kim
G Blackshields
GA Price
GH Gonnet
I Van Walle
J Flannick
J Kececioglu
J Pei
JD Thompson
JD Thompson
JG Henikoff
K Katoh
M Box
MA Larkin
MO Dayhoff
MP Styczynski
MS Waterman
O Chapelle
RC Edgar
RC Edgar
Robert C Edgar
S Henikoff
T Lassmann
T Muller
T Muller
TM Phuong
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments. Results POP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB. Conclusion The hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at <url>http://www.drive5.com/pop</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement

Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms.We describe a new method to align two or more genomes that have undergone rearrangements due to recombination and substantial amounts of segmental gain and loss (flux). We demonstrate that the new method can accurately align regions conserved in some, but not all, of the genomes, an important case not handled by our previous work. The method uses a novel alignment objective score called a sum-of-pairs breakpoint score, which facilitates accurate detection of rearrangement breakpoints when genomes have unequal gene content. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods. We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The new genome alignment algorithm demonstrates high accuracy in situations where genomes have undergone biologically feasible amounts of genome rearrangement, segmental gain and loss. We apply the new algorithm to a set of 23 genomes from the genera Escherichia, Shigella, and Salmonella. Analysis of whole-genome multiple alignments allows us to extend the previously defined concepts of core- and pan-genomes to include not only annotated genes, but also non-coding regions with potential regulatory roles. The 23 enterobacteria have an estimated core-genome of 2.46Mbp conserved among all taxa and a pan-genome of 15.2Mbp. We document substantial population-level variability among these organisms driven by segmental gain and loss. Interestingly, much variability lies in intergenic regions, suggesting that the Enterobacteriacae may exhibit regulatory divergence.The multiple genome alignments generated by our software provide a platform for comparative genomic and population genomic studies. Free, open-source software implementing the described genome alignment approach is available from http://gel.ahabs.wisc.edu/mauve

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

OPUS - University of Technology Sydney

PubMed Central

Meta-Alignment with Crumble and Prune: Partitioning very large alignment problems for performance and parallelization

Author: A Siepel
A Siepel
AS Schwartz
B Paten
B Paten
B Rhead
Benedict Paten
C Lee
CN Dewey
David Haussler
DF Feng
G Myers
I Lumb
J Ma
JE Stajich
JS Pedersen
K Katoh
K Katoh
K Kryukov
K Liu
K Reinert
KM Roskin
Krishna M Roskin
M Blanchette
M Hasegawa
M Waterman
N Bray
P Di Tommaso
RC Edgar
RK Bradley
S Griffiths-Jones
S Schwartz
T Kim
U Tönges
W Gentzsch
WJ Kent
WJ Kent
Z Yang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Continuing research into the global multiple sequence alignment problem has resulted in more sophisticated and principled alignment methods. Unfortunately these new algorithms often require large amounts of time and memory to run, making it nearly impossible to run these algorithms on large datasets. As a solution, we present two general methods, Crumble and Prune, for breaking a phylogenetic alignment problem into smaller, more tractable sub-problems. We call Crumble and Prune <it>meta-alignment </it>methods because they use existing alignment algorithms and can be used with many current alignment programs. Crumble breaks long alignment problems into shorter sub-problems. Prune divides the phylogenetic tree into a collection of smaller trees to reduce the number of sequences in each alignment problem. These methods are orthogonal: they can be applied together to provide better scaling in terms of sequence length and in sequence depth. Both methods partition the problem such that many of the sub-problems can be solved independently. The results are then combined to form a solution to the full alignment problem. Results Crumble and Prune each provide a significant performance improvement with little loss of accuracy. In some cases, a gain in accuracy was observed. Crumble and Prune were tested on real and simulated data. Furthermore, we have implemented a system called Job-tree that allows hierarchical sub-problems to be solved in parallel on a compute cluster, significantly shortening the run-time. Conclusions These methods enabled us to solve gigabase alignment problems. These methods could enable a new generation of biologically realistic alignment algorithms to be applied to real world, large scale alignment problems.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Optimality regions and fluctuations for Bernoulli last passage models

Author: A-L Basdevant
AS Malaspinas
C Houdré
C Vinzant
CA Tracy
CN Dewey
D Fernández-Baca
D Gusfield
D Maier
DS Hirschberg
EW Myers
H Cramèr
J Komlós
J Lember
J Lember
Janosch Ortmann
JB Martin
L Bergroth
L Pachter
L Pachter
M Kiwi
M Vingron
N Georgiou
N Georgiou
N O’Connell
Nicos Georgiou
PC Ng
PW Glynn
S Aluru
S Amsalu
S Henikoff
SB Needleman
T Bodineau
T Seppäläinen
T Seppäläinen
TF Smith
V Chvátal
V Hower
VB Priezzev
WJ Masek
X Xia
Y Baryshnikov
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/03/2018
Field of study

We study the sequence alignment problem and its independent version,the discrete Hammersley process with an exploration penalty. We obtain rigorous upper bounds for the number of optimality regions in both models near the soft edge.At zero penalty the independent model becomes an exactly solvable model and we identify cases for which the law of the last passage time converges to a Tracy-Widom law

arXiv.org e-Print Archive

Crossref

Sussex Research Online

Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

Author: A Delcher
A Smit
AC Darling
B Ma
C Kemena
CN Dewey
Darren P. Martin
DR Bentley
E Ohlebusch
EJ Vallender
FP Preparata
G Bejerano
G Bourque
Hachiya Tsuyoshi
I Tabus
JT Simpson
K Liolios
K Mathee
Kris Popendorf
LB Kish
M Blanchette
M Brudno
M Farach
P Pevzner
Pearson
R Rivest
RA Gibbs
RH Waterston
S Quinlan
S Schwartz
SF Altschul
T Hachiya
T Hubbard
TF Smith
W Miller
Y Osana
Yasubumi Sakakibara
Yasunori Osana
Publication venue: Public Library of Science
Publication date: 24/09/2010
Field of study

BACKGROUND: With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows. METHODOLOGY/PRINCIPAL FINDINGS: Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy. CONCLUSIONS/SIGNIFICANCE: Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Fast Statistical Alignment

We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Caltech Authors

Parameters for accurate genome alignment

Author: A Morgulis
A Morgulis
A Schwartz
A Stark
B Paten
CH Yuh
CN Dewey
D Gusfield
D Karolchik
D States
DA Pollard
E Kim
EH Margulies
F Chiaromonte
G Benson
G Lunter
G Lunter
I Holmes
J Ruan
J Wang
JC Wootton
JE Janecka
JO Kriegs
JT Reese
KD Pruitt
KM Wong
LA Newberg
LE Carvalho
M Brudno
M Hamada
Martin C Frith
MC Frith
Michiaki Hamada
MS Waterman
Paul Horton
PP Gardner
R Durbin
RC Friedman
RK Bradley
S Karlin
S Kumar
S Miyazawa
S Schwartz
S Sheetlin
SF Altschul
SF Altschul
TJ Treangen
W Huang
WJ Kent
WJ Kent
YK Yu
Z Zhang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Genome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Surprisingly, there has been no large-scale assessment of these choices using real genomic data. Moreover, rigorous procedures to control the rate of spurious alignment have not been employed. Results We have assessed 495 combinations of score parameters for alignment of animal, plant, and fungal genomes. As our gold-standard of accuracy, we used genome alignments implied by multiple alignments of proteins and of structural RNAs. We found the HOXD scoring schemes underlying alignments in the UCSC genome database to be far from optimal, and suggest better parameters. Higher values of the X-drop parameter are not always better. E-values accurately indicate the rate of spurious alignment, but only if tandem repeats are masked in a non-standard way. Finally, we show that γ-centroid (probabilistic) alignment can find highly reliable subsets of aligned bases. Conclusions These results enable more accurate genome alignment, with reliability measures for local alignments and for individual aligned bases. This study was made possible by our new software, LAST, which can align vertebrate genomes in a few hours <url>http://last.cbrc.jp/</url>.</p

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Evolutionary Modeling and Prediction of Non-Coding RNAs in Drosophila

Author: A Siepel
A Siepel
A Stark
A Varadarajan
AG Clark
Andrew V. Uzilov
B Knudsen
B Paten
CN Dewey
D Rose
D St Johnston
DP Bartel
DS Parker
E Boyle
E Lcuyer
E Nawrocki
E Rivas
E Rivas
E Torarinsson
G McGuire
Ian Holmes
IL Hofacker
J Brennecke
J Pedersen
J Ruby
JL Thorne
JP Bachellerie
JR Manak
JS Pedersen
JS Pedersen
JS Pedersen
KS Pollard
Lars Barquist
M Crosby
M Mandal
M Pheasant
M Sprinzl
Mitchell E. Skinner
N Bray
N Goldman
PD Rijk
PS Klosterman
RD Dowell
RD Dowell
RK Bradley
Robert Belshaw
Robert K. Bradley
S Griffiths-Jones
S Washietl
T Babak
T Elgavish
T Gesell
TM Lowe
V Ambros
WJ Bruno
YR Bendana
Yuri R. Bendaña
Z Wang
Z Yang
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

We performed benchmarks of phylogenetic grammar-based ncRNA gene prediction, experimenting with eight different models of structural evolution and two different programs for genome alignment. We evaluated our models using alignments of twelve Drosophila genomes. We find that ncRNA prediction performance can vary greatly between different gene predictors and subfamilies of ncRNA gene. Our estimates for false positive rates are based on simulations which preserve local islands of conservation; using these simulations, we predict a higher rate of false positives than previous computational ncRNA screens have reported. Using one of the tested prediction grammars, we provide an updated set of ncRNA predictions for D. melanogaster and compare them to previously-published predictions and experimental data. Many of our predictions show correlations with protein-coding genes. We found significant depletion of intergenic predictions near the 3′ end of coding regions and furthermore depletion of predictions in the first intron of protein-coding genes. Some of our predictions are colocated with larger putative unannotated genes: for example, 17 of our predictions showing homology to the RFAM family snoR28 appear in a tandem array on the X chromosome; the 4.5 Kbp spanned by the predicted tandem array is contained within a FlyBase-annotated cDNA

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Genetic polymorphisms in DNA repair and damage response genes and late normal tissue complications of radiotherapy for breast cancer

Breast-conserving surgery followed by radiotherapy is effective in reducing recurrence; however, telangiectasia and fibrosis can occur as late skin side effects. As radiotherapy acts through producing DNA damage, we investigated whether genetic variation in DNA repair and damage response confers increased susceptibility to develop late normal skin complications. Breast cancer patients who received radiotherapy after breast-conserving surgery were examined for late complications of radiotherapy after a median follow-up time of 51 months. Polymorphisms in genes involved in DNA repair (APEX1, XRCC1, XRCC2, XRCC3, XPD) and damage response (TP53, P21) were determined. Associations between telangiectasia and genotypes were assessed among 409 patients, using multivariate logistic regression. A total of 131 patients presented with telangiectasia and 28 patients with fibrosis. Patients with variant TP53 genotypes either for the Arg72Pro or the PIN3 polymorphism were at increased risk of telangiectasia. The odds ratios (OR) were 1.66 (95% confidence interval (CI): 1.02–2.72) for 72Pro carriers and 1.95 (95% CI: 1.13–3.35) for PIN3 A2 allele carriers compared with non-carriers. The TP53 haplotype containing both variant alleles was associated with almost a two-fold increase in risk (OR 1.97, 95% CI: 1.11–3.52) for telangiectasia. Variants in the TP53 gene may therefore modify the risk of late skin toxicity after radiotherapy

Crossref

PubMed Central